Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Shotgun Metagenomic Data Analysis ◾ 309

-1 fastq_pure/ERR1823608_pure_R1.fastq.gz \

-2 fastq_pure/ERR1823608_pure_R2.fastq.gz \

-0 /dev/null -s /dev/null -n

The saved FASTQ files contain the reads of the metagenomic data after removing the con-

taminating human sequences.

If you run FastQC on those files, you will find that the reads length varies between 35

and 151 bp. We can remove any paired reads of length less than 50 bp. Removing such

reads from paired-end FASTQ files requires a program or a script that removes reads from

both pair of files without leaving singletons that may be rejected by some programs used in

the downstream analysis. Thus, we need to filter reads of certain length by using the right

program. There may be some programs that can do this but we use a bash script for this

purpose. First, change into “fastq_pure” and run the following script to decompress the

FASTQ files:

cd fastq_pure

for i in $(ls *.gz);

gzip -d ${i}

done

Then, open a text editor of your choice such as “vim or nano” and save the following bash

script in a file “remove_PE.sh”:

vim remove_PE.sh

Then, copy the bash script into the file, save it, and exit:

#!/bin/sh

#Use: remove_PE.sh R1.fastq R2.fastq 80

#1. Start with inputs

fq_r1=$1

fq_r2=$2

minLength=$3

#2. Find all entries with read length less than minimum length and

print line numbers, for both R1 and R2

awk -v min=$minLength \

‘{if(NR%4==2) \

if(length($0)<min) \

print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \

$fq_r1 > temp.lines1

awk -v min=$minLength \

‘{if(NR%4==2) \

if(length($0)<min) \

print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \

$fq_r2@ temp.lines1